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ABSTRACT 

Motivation: The construction of statistics for summarizing posterior 
samples returned by a Bayesian phylogenetic study has so far been 
hindered by the poor geometric insights available into the space of 
phylogenetic trees, and ad hoc methods such as the derivation of a 
consensus tree makeup for the ill-definition of the usual concepts of 
posterior mean, while bootstrap methods mitigate the absence of a 
sound concept of variance. Yielding satisfactory results with suffi- 
ciently concentrated posterior distributions, such methods fall short 
of providing a faithful summary of posterior distributions if the data 
do not offer compelling evidence for a single topology. 
Results: Building upon previous work of Billera et al., summary stat- 
istics such as sample mean, median and variance are defined as the 
geometric median, Frechet mean and variance, respectively. Their 
computation is enabled by recently published works, and embeds 
an algorithm for computing shortest paths in the space of trees. 
Studying the phylogeny of a set of plants, where several tree topolo- 
gies occur in the posterior sample, the posterior mean balances cor- 
rectly the contributions from the different topologies, where a 
consensus tree would be biased. Comparisons of the posterior 
mean, median and consensus trees with the ground truth using simu- 
lated data also reveals the benefits of a sound averaging method when 
reconstructing phylogenetic trees. 

Availability and implementation: We provide two independent im- 
plementations of the algorithm for computing Frechet means, geomet- 
ric medians and variances in the space of phylogenetic trees. 
TFBayes: https://github.com/pbenner/tfbayes, TrAP: https://github. 
com/bacak/TrAP. 

Contact: philipp.benner@mis.mpg.de 
1 INTRODUCTION 

Phylogenetic trees are central to the study of evolution, so much 
that the sketch of a tree of species by Sir Charles Darwin has 
become the icon of this theory. Nowadays, trees relating units of 
selection (be it functional domains, genes or species) are struc- 
tures of primary interest for systematists, but also instrumental 
to a wealth of other studies where evolutionary correlations need 
to be accounted for [see, for instance, McCue et al. (2001)]. 
Various statistical models pertaining to diverse types of observ- 
ables can be found in the literature, as well as methods, for 
estimating their parameters and reconstructing a phylogenetic 
tree (Gascuel, 2005). Some estimation methods proceed by max- 
imizing a posterior distribution or a likelihood function, and are 
amenable to an exact reconstruction of the optimal tree, but 
Bayesian phylogenetic analyses generally produce posterior dis- 
tributions that are best explored by generating posterior samples. 
While a large enough posterior sample offers a faithful 
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representation of the posterior knowledge, it is of little scientific 
interest unless summarized by some statistics (Robert, 2001). A 
summary can balance contributions from the different tree topol- 
ogies occurring in the sample, resulting in a legit phylogenetic 
tree, or combine them within a phylogenetic network. Here we 
focus on the former, showing how to build a phylogenetic tree 
that faithfully represents the sample in its entirety, despite com- 
peting topologies occur. 

Provided a unique topology with n edges occurs in the sample, 
each tree including its edge lengths can be identified by a point in 
the positive orthant of the Euclidean space MP . Performing an 
average of the sample in this linear representation is a straight- 
forward operation, which produces a legit posterior mean tree. If 
more than one tree topology occurs, the trees are no longer 
mapped all to the same linear space, and the posterior mean is 
ill-defined. Selecting the a posteriori most probable tree topology 
may seem a sound alternative, however, with the unpleasant 
consequence of neglecting all the sampled trees of different 
topology, and therefore would not provide a satisfactory repre- 
sentation of the posterior. The construction of a consensus tree, 
using an absolute majority rule (Margush and McMorris, 1981) 
to decide which one among competing edges should be retained, 
has been widely adopted by the interested community as the 
method of choice to summarize posterior samples of phylogen- 
etic trees. On the theoretical side, decision-theoretic justifications 
of this construction have been proposed (Holder et al, 2003; 
Huggins et aL,20\\). However, they are built upon loss functions 
that neglect edge lengths, focusing only on the tree topology. 
Besides, from the authors' point of view, it is also a rather 
conservative approach, as the absence of an absolute majority 
among edges results in the inclusion of none of them, thereby 
producing unresolved branching points. The extended majority- 
rule consensus method (also known as greedy consensus method) 
has been introduced to remedy this drawback by adding edges 
with <50% support (Bryant, 2003). Despite this improvement, 
the consensus methods neglect much of the available information 
in a sample by ignoring the context in which an edge occurs (i.e. 
the remaining topology of the tree as well as all other edge 
lengths). Reporting a posterior mean that balances the contribu- 
tions from each topology including edge lengths rather than 
isolated edges would therefore be of utmost interest. 

Building upon the work pubhshed by Billera et al. (2001), who 
deciphered the geometric structure of the space of phygenetic 
trees and first proposed a construction of the tree space (some- 
times also called BHV tree space, where BHV is an acronym of 
Billera, Holmes and Vogtmann), the purpose of this article is to 
show how the computation of the posterior mean of a sample of 
phylogenetic trees can be achieved by simply reaching out for the 
appropriate geometry. The BHV space is obtained by gluing 
together the positive orthants of the linear space associated to 
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each topology, so that a point in this space identifies both a tree 
topology and the lengths of the corresponding edges. The adja- 
cency structure between any two orthants reflects the edges 
shared by the two corresponding topologies and permits the 
definition of paths visiting several orthants. Any two trees are 
connected by at least one path, and the one with minimal length 
is called a geodesic. Therefore, the length of a geodesic qualifies 
as a distance function between phylogenetic trees, and offers a 
theoretically and practically appealing alternative to existing dis- 
tances (e.g. NNI or the Robinson-Foulds distance). 
Furthermore, using implicit characterizations of the posterior 
mean and median as minimizers of appropriate loss functions, 
algorithms developed by Bacak (2013, 2014); Miller et al. (2012); 
Sturm (2003) compute an approximation of these statistics by 
walking along geodesies. Here, the determination of the geo- 
desies is done in polynomial time, thanks to an algorithm 
owing to Owen and Provan (2011). 

This article gathers and combines technical results scattered 
across multiple mathematics papers, into a general statistical 
framework for analyzing posterior distributions over phylogen- 
etic trees easily accessible to the target readership, namely prac- 
titioners in Bayesian phylogenetics. Certainly, many other 
applications exist where computing an average phylogenetic 
tree is of great importance. The methods presented here are 
not restricted to Bayesian statistics, yet in this context, they 
allow to recover basic statistical concepts such as the posterior 
mean and median, as well as a notion of variance to measure the 
posterior uncertainty. 

After a gentle introduction to the geometry of the tree space in 
Section 2.1, the geometric median and Frechet mean over this 
space are constructed in Section 2.2. The algorithms computing 
those quantities are outlined in Section 2.3, while Section 3 shows 
the method in action, and illustrates how it compares with the 
majority- rule consensus method. 

2 METHODS 

Given a generative model and a prior distribution over its parameter 
space, a Bayesian analysis of observations carried across species related 
by evolution produces a posterior distribution over the space of all pos- 
sible phylogenetic trees for this set of species (Gascuel, 2005; Robert and 
Casella, 1999). The size of this space grows super-exponentially with the 
number of species, and it is often intractable to compute the normaliza- 
tion constant of this distribution. In such cases, sampling methods offer a 
way to explore the posterior distribution via an arbitrarily large sample 
drawn from it without requiring any further knowledge. However, al- 
though a posterior sample offers a representation of the full posterior 
distribution, it is of little scientific interest in absence of a method to 
summarize it. Building upon previous works by Billera et al. (2001), 
Miller et al. (2012), Owen and Provan (2011) and Bacak (2013, 2014), 
we propose here to define and compute posterior mean in a sound 
manner, an approach so far hindered by the poor geometrical insights 
into the space of phylogenetic trees (see also Gascuel, 2005). 

2.1 The geometry of the tree space 

The elucidation of the geometric structure of the space of phylogenetic 
trees is because of Billera, Holmes and Vogtmann (2001). For any integer 
/I > 3, a phylogenetic n-tree is a connected graph without cycle that has 
?i + 1 terminal vertices called leaves, which are labeled from 0 io n and 
associated with the n + \ species considered in the phylogenetic study. 



0 




1 2 3 5 6 

Fig. 1. An example of a 6-tree 

The non-terminal vertices of a tree bear no label, as they are seen as sheer 
'branching points'. An example of a 6-tree is shown in Figure 1. 

The construction of the tree space relies fundamentally on the identi- 
fication of edges of the tree as splits: an edge is uniquely associated to the 
partition of the set of leaves L = {0, . . . ,n] into two disjoint and non- 
empty subsets Li U L2=L that would be disconnected in the graph struc- 
ture by its removal. Such a split is denoted by Li\L2. For instance, the 
edges labeled ei, e2 and e^ of the tree in Figure 1 correspond to the splits 
(0,4, 5, 6| 1,2,3), (0, 1,2,314,5,6) and (0, 1, 2, 3, 4|5, 6), respectively. 
Conversely, given a set of leaves and splits subjects to certain conditions, 
a unique tree is specified. Namely, it is required that any two splits Li IL2 
and Li^\L2 are compatible, that is, one of the sets 

Li n L2, Li n L2, Li n l/, L2 n L2 

must be empty. 

The topology of a phylogenetic tree t can therefore be uniquely speci- 
fied via the associated set of compatible splits, which is denoted by S{t). 
Yet, a phylogenetic tree is more than a sheer topology, as its edges also 
have (positive) length. Writing \e\t for the length of the edge e^Li IL2 of 

one obtains a canonical mapping of the phylogenetic tree / onto 
by further setting \e'\t = ^ whenever e' = L\\L2 ^ S{t). 

While any phylogenetic tree over a given set of species can be repre- 
sented in this way in the linear cone + the converse is obviously not 
true: Assume that e and e' are two incompatible edges, any coordinates in 
the linear cone that have positive entries for e and e' do not correspond to 
any legit phylogenetic tree. In other words, the set of phylogenetic trees 
forms a manifold in the linear cone All the phylogenetic trees 

/ sharing a given topology also exhibit the same split set S, and are 
such that |e|; = 0 whenever e^S. The BHV tree space can therefore be 
understood as a collection of smaller dimensional positive orthants 
embedded jointly in [R^ each associated to a particular tree topology. 
Among these orthants, those of maximal dimension play a special role: a 
tree whose representation lies in their interior is binary, as such trees have 
the maximal possible number of edges. Shrinking the length of any edge 
down to zero results in the formation of a triple branching-point, so that 
the orthants associated to non-binary tree topologies appear as the faces 
of larger dimensional orthants. 

More interestingly, non-maximal orthants (those associated to non- 
binary tree topologies) are faces of several orthants simultaneously. The 
simplest instance is a triple-branching point, from which three different 
edges can be grown depending on which pair of species diverged first. As 
these three edges cannot be compatible, these three departures from the 
triple-branching point lie in different orthants, which are therefore all 
adjacent. Figure 2 shows a section of T4, where every tree in the interior 
of the orthant is a binary tree. For instance, the point (0, 1 /2) may be 
reached by taking a tree from the interior and shrinking the length of the 
edge e\ to zero. This location corresponds to a non-binary tree that lies at 
the face of three maximal orthants. 

As an example, take the space T3, which consists of all trees with only 
four leaves. Binary trees in this space have a single inner edge identified as 
one of the splits (0, 1|2, 3), (0, 2|1, 3) or (0, 3|1, 2). An orthant [0, 00) is 
associated with each of the splits. The origin 0 is a face of each orthant. 
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Fig. 2. 4-trees of a given combinatorial structure. The horizontal direc- 
tion shows the length of e\ : (0, 3, 4|1, 2), whereas the vertical direction 
shows ^2 : (0,4 1 1,2, 3) 



which represents the same non-binary tree. A piece of T3 is constructed 
by gluing all three orthants together at this common point. 

The tree space, as a compound of orthants, is not a convex subset of 
while one can form a linear interpolation between two trees of 
different topologies in the embedding space + by simply shrinking 
the edges to be removed and simultaneously growing those to be created, 
the cooccurrence of incompatible edges along this path places it outside 
of the tree space. However, as seen above, the orthants composing the 
tree space have a rich adjacency structure, which guarantees at least the 
path-connectedness of the space: any two phylogenetic trees can be con- 
nected by at least one path that remains in this space, although this path 
might not be a straight line as in whenever the topologies differ. 

Yet, these paths have a length, and one can consider the shortest path 
connecting two trees. In geometric terms, such a path is called a geodesic. 
Defining a distance over the tree space as the length of the shortest path 
connecting any two points equips the space with a metric d(-, •) which, as 
an alternative to the NNI or Robinson-Foulds distances, allows to meas- 
ure the discrepancy between any two trees, whatever their topologies. 
Although the technical details are not important for this article, it 
should be noted that the BHV tree space is a Hadamard space (Billera 
et al., 2001), which allows to use tools from this mathematical field. 

2.2 Mean, median and variance of a sample of 
phylogenetic trees 

Let us now exploit the above geometric properties of the tree space to 
summarize a sample of phylogenetic trees by a single point. Following the 
rationale of decision theory (e.g. Robert, 2001), the construction begins 
with the definition of a loss function, which measures how faithful a rep- 
resentation of the sample would be achieved by a given point in the tree 
space. A loss function is defined as the cost C{t, f) of choosing a phylo- 
genetic tree f instead of some other and the decision theory literature 
advocates strongly to summarize a posterior distribution by choosing t as 
the minimizer of the expected loss function 

F= argmin / £(/, /)/Xj|xd/ , (1) 

where jx^^x denotes the posterior distribution over phylogenetic trees 
given the data X. Approximating the latter via a posterior sample of 
phylogenetic trees /i , . . . , Z^^, the above formula becomes 

1 ^ 

t = arg min — C{ti, t') . 

Two very typical choices for the loss function are the distance and the 
squared distance. When the parameters to be estimated lie in a Euclidean 



space, it is well-known that the resulting estimates coincide respectively 
with the median and mean of the posterior distribution. Although the tree 
space is not Euclidean, distances between pairs of trees are well defined, 
and a minimizer of (Equation 1) can be sought, respectively yielding the 
so-called geometric median and Frechet mean. 

In contrast to other approaches that provide a decision-theoretic ar- 
gument for point estimates of phylogenetic trees (e.g. Holder et al, 2003; 
Huggins et al., 2011), the loss function considered here derives the intrin- 
sic metric of the underlying space. In particular, the loss function con- 
siders both the topology and the branch lengths of phylogenetic trees, as 
opposed to those supporting the consensus method, and thereby con- 
siders all available information in a sample. Unfortunately, in tree 
space, a simple gradient search is not a practical method to solve such 
optimization problems (see Miller et al, 2012). Therefore, appropriate 
algorithms are required and will be presented in Section 2.3. 

A side benefit of the method presented here is the sound definition of 
the sample variance, also called the Frechet variance, which is simply 
given as the value of the minimization problem with the squared distance. 
In complement to the point estimate, this quantity provides the modeler 
with insight onto the reliability of the point estimate. It is noteworthy that 
existing phylogenetic reconstruction methods are not tied to a notion of 
variance, and often retort to bootstrapping methods for reporting similar 
information. 

2.3 Computing the sample mean, median and variance 

The question of how to compute medians and means of a given set of trees 
will be addressed in the following. It turns out that efficient approxima- 
tion methods from optimization theory can be extended into Hadamard 
spaces and applied to median and mean computations. For the reader's 
convenience, the simple version for unweighted medians and means is 
outlined here (see also Bacak, 2013). 

2.3.1 Algorithm for computing medians Let us first describe the 
algorithm for computing a median of a given set of trees 
~t = {t\, . . . ,tK] (i.e. the set of all tree samples). 

Set xo = ?! and suppose that at the z-th iteration, an approximation 
G T„ of the median of 7 is available. To find x/+ 1 , a tree tk is selected from 
the set of trees 7 at random and x/+ 1 is defined as a point on the geodesic 
between x,- and tk- (In other words, x/+ 1 is a convex combination of x, and 
tk, which will be denoted (1 — X)xi + Xtk for some X e [0, 1].) The position 
of X/+ 1 on this geodesic is determined by a parameter r]^ e [0, 1], which is 
computed at each iteration. By this procedure, we obtain a sequence of 
trees xi , X2, . . ., which is known to converge to a median of 7. 

Algorithm 2.1 (Computing median). Let xo = t\. At each step i e Mq, 
choose randomly r, e {I, . . . , K} according to the uniform distribution and 
put 

Xf+i =(1 - r]i)Xi + r]it,.., 

with r]i defined by r]j = minj 1 , (7+-j^^^— ^| , for each i eNq, where d is the 
metric on T„. 

2.3.2 Algorithm for computing means Computing the mean is simi- 
lar to the computation of the median. As a matter of fact, it only differs in 
the coefficients determining the position of x/+i on the geodesic from x,- 
to tk. 

Algorithm 2.2 (Computing mean). Let xq = tj and, at each step, i e Nq, 
choose randomly rj e {I, . . . , K} according to the uniform distribution and 
put 
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The approximation algorithms for computing medians and means use 
(at each step) the algorithm for finding a geodesic in tree space by Owen 
and Pro van (2011). 



3 RESULTS 

The content of this section is intended to illustrate the behavior 
of the posterior mean and median, in comparison with the ma- 
jority-rule consensus tree, which is, for instance, computed by 
MrBayes (Huelsenbeck and Ronquist, 2001). Following a formal 
argument that relates these two estimates when extremely large 
or little information is available, the posterior distributions 
derived from real and simulated datasets are investigated in a 
way that best illustrates the different outcomes yielded by the 
existing and the proposed approaches. A prerequisite is the adop- 
tion of a specific statistical model, which is outlined first. 

3.1 Consensus versus posterior mean 

The majority- rule consensus method is a reference method to 
summarize samples from a posterior distribution. There, the con- 
sensus tree consists of those splits that occur in >50% of the 
samples. The average length of a retained edge is computed 
using the sub sample where the corresponding split does occur, 
thereby neglecting a fraction of the posterior mass, but also the 
context in which the split occurs. In contrast, the Frechet mean 
and geometric median account for the full posterior, and are 
expected to provide a more meaningful summary. However, 
both estimates have a property called stickiness (see Miller 
et al., 2012): If there is a high posterior uncertainty on the 
topology, the Frechet mean and geometric median result in 
non-binary trees, a behavior that parallels the multiple branching 
points reconstructed by the consensus tree when no absolute 
majority occurs. 

Take for instance the space T3 that consists of three orthants 
[0, 00) glued together at 0, as discussed earher, and place a tree 
on each orthant. If all three trees are equally far apart, say at a 
distance r to the origin, then obviously the Frechet mean lies at 
the origin. The term stickiness refers to the fact that the mean 
stays at the origin if one of the trees is moved further away. In 
fact, one tree may be located at a distance anywhere between 
r and 2r away from the origin without affecting the mean. 
Instead of moving one tree further away from the origin, one 
may similarly add another tree somewhere between the three 
trees, and the Frechet mean would again stay at the origin. 

A probabihstic counterpart of this phenomenon can be 
observed in the same setting. Equip T 3 with a distribution 
whose trace in each orthant is a normalized Gaussian distribu- 
tion, centered at identical distance from the origin, and truncated 
at 0. By symmetry the Frechet mean is at the origin, and one can 
ask how far the location parameter m of one component can be 
perturbed without affecting the mean. In T3, m is just a scalar, 
and one can study the distance of the Frechet mean t to the 
origin 0 as a function of m. An analytic but complicated solution 
of the distance d(Q, t) exists; however, a fairly good answer is 
provided by the following approximation: 



40, t) ^ maxio, 



m-^/2l{m)-2^{\) \ 
1+205(1) J 



, m > 1 



where $ is the standard normal distribution function. The 
Frechet mean stays at the origin until m reaches ~2.16, which 
roughly matches the case of only three trees. Also in this case one 
may similarly increase the probability mass on one orthant and 
the Frechet mean would stay at the origin until a certain thresh- 
old is reached. 

3.2 Statistical model 

The statistical model we use in this study is often used for the 
analyses of motifs of transcription factor binding sites. It became 
popular because of its analytic tractability (cf. Siddharthan et al, 
2005), and it is simple enough so that the marginal likelihood of 
an alignment, given a phylogenetic tree can be computed analyt- 
ically. In particular, it permits Bayesian model selection, or to 
gain some insights on how well an estimate generalizes to new 
data, and therefore qualifies perfectly for the purpose of compar- 
ing different downstream methods for summarizing the posterior 
samples. 

A dataset consists of an ahgnment of ^ + 1 homologous se- 
quences. The observations within each column of the alignment 
are assumed to evolve according to a phylogenetic tree t. The 
topology of the phylogenetic tree is a priori uniformly distribu- 
ted, and the edge lengths of t are drawn from a gamma distri- 
bution. The process of evolution is specified by a mutation 
model, which is parameterized by a column- specific stationary 
distribution 0. It follows a priori a Dirichlet distributed with 
pseudocounts a. It is sufficient to discuss the model for a 
single column of the ahgnment. Binary trees have In — \ 
nodes, each of which is associated with a random variable X^^^ 
that takes values in an alphabet A. Assume that node k is the 
parent of node /. The mutation process along the edge between 
the two nodes is defined as 

= X, 0 = i> - ;7m, Categorical(i>) + p^6^ , 

which was introduced by Felsenstein (1981). The probability of a 
mutation from node k to node / is denoted pm^ and depends on 
the length / of the edge that connects the two nodes, i.e. 
PMi = 1 — exp (— /). Furthermore, pj^ denotes the probability of 
no mutation, given as pj^ = 1 — pMr 

To obtain samples from the posterior distribution, we imple- 
mented a Metropohs coupled MCMC algorithm (Geyer and 
Thompson, 1995) for the given model. In the sequel, these sam- 
ples will serve as input to the reconstrution of the posterior mean, 
median, and consensus trees. 

3.3 Estimation results 

Using a multiple sequence alignment from a study by Karol et al. 
(2001), which was shghtly modified by Yang and Rannala 
(2005), the phylogeny of the small sub unit rRNA gene (SSU 
rRNA) from the nuclear genome of eight land plants and six 
charales (see Fig. 3) has been reconstructed. It appears that the 
edge that separates Psilotum nudum and Dicksonia antarctica 
from the remaining tree has a very short length of ^^0.0027. 
Figure 4 shows the marginal posterior distribution of this edge 
(^i) and a competing one (^2) that groups P.nudum with Taxus 
baccata and Arabidopsis thaliana. There remains a high posterior 
uncertainty about the exact topology of the tree at this very 
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Tolypella int prolifera 
Nitella opaca 

- Marchantia polymorpha 

- Sphagnum palustre 
'Anthoceros formosae 

- Huperzia lucidula 
' Dicksonia antarctica 

- Psilotum nudum 

-Arabidopsis thaliana 
- Taxus baccata 
Lamprothamnium macropogon 
Chara connivens 
Nitellopsis obtusa 
Lychnothamnus barbatus 

Fig. 3. Frechet mean estimated from the small subunit rRNA gene (SSU 
rRNA) from the nuclear genome of eight land plants and six charales. 
Edge lengths are plotted in horizontal direction only 
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Fig. 4. Marginal posterior density estimate of two edges ei and 62- The 
edge Ci groups P.nudum with D.antarctica, while 62 groups P.nudum with 
T. baccata and A. thaliana. The Frechet mean is shown as a vertical line 



branching. The posterior mass on e\ is, however, sufficient for 
the majority- rule consensus tree to include this edge. However, 
there, it has a much longer length than in the posterior mean tree 
(by '^0.01 1) because this length is obtained by averaging only the 
lengths of this edge when it occurs in the sample, neglecting the 
contributions of the alternative edges. Clearly, the shorter edge 
length of the posterior mean tree better accounts for the uncer- 
tainty, and the consensus tree appears, in contrast, to have over- 
estimated branch lengths. 

The assessment of reconstruction methods for phylogenetic 
trees is notoriously hindered by the ignorance of the true evolu- 
tionary history to be uncovered, as the latter is never observed. 
Instead, the estimated tree of Figure 3 has been used to generate 
50 ahgnments of length m = 50, 100, 250 and 500. For each 
generated dataset, 210000 posterior samples were obtained 
using one cold Markov chain and three heated chains. The 
Frechet mean, geometric median and consensus tree of the last 
200 000 samples were computed. In the whole study, tree topol- 
ogies are uniformly distributed a priori, while branch lengths are 
distributed according to a Gamma(l, 0.4). 

Figure 5 shows the distances of the computed estimates to the 
generating tree. For alignments of the lengths considered, the 
Frechet mean and geometric median are generally closer to the 
generating tree. A trend appears, from the greatest discrepancy 
observed for the shortest alignments, to an almost systematic 
agreement for the longest ahgnments. It should be noted that 
even shorter alignments generally result in very broad a posterior 
distributions so that all three estimates coincide with a star tree. 
At the opposite extreme, large datasets support a clear decision 
about the topology of the tree, placing most of the mass of the 
posterior distribution in a single orthant, and resulting in mostly 
agreeing estimates. One also observes that the geometric median 
is in most cases closer to the generating tree than the Frechet 
mean. This comes at no surprise given the skewness of the 
gamma prior on the branch lengths. 

The model permits an analytical computation of the marginal 
likelihood of an ahgnment given a phylogenetic tree, thereby 
offering an evaluation of how the estimated model generalizes 
to novel observations. Using a leave-one-out approach, the aver- 
age (unnormalized) posterior value achieved by the estimators 
was computed on the remaining 49 datasets of the same length 
(see Table 1). For all ahgnment lengths, the Frechet mean and 
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Fig. 5. Distances d(-, •) of the Frechet mean fi, geometric median F2 and 
consensus tree F3 to the generating tree / for alignments of length 50 (a), 
100 (b), 250 (c) and 500 (d). The straight line shows the main diagonal 



geometric median show a slightly higher average posterior value 
compared with the majority-rule consensus tree. But the differ- 
ence is too minor to make any definite statements, as also shown 
by the variance of the estimates. Interestingly, however, while the 
variance of the mean and median steadily decreases with data 
length, the consensus estimate shows an increase in variance for a 
data length of 250. A much clearer picture is gained by consider- 
ing how often the mean and median have a higher posterior 
value than the consensus (see Table 1). The results show that 
the consensus tree clearly perfomes worse. 

Another quantity of interest is the Frechet variance of the pos- 
terior distribution, which provides us with a measure of uncer- 
tainty. The mean variance is shown in Figure 6 separately for all 
four dataset lengths. Similar to the case of normal distributed i.i.d. 
random variables, the variance decreases approximately with 
\/m. Another, maybe more intuitive statistic, is to compute a 
credibihty region around the Frechet mean t that contains 
a given proportion c of the posterior mass. More precisely, 
consider the set of trees B={t eT,^ \ d(t, t) < ct] for some ^/* 
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Table 1. (a) Mean posterior values for the Frechet mean, geometric median and consensus tree (the variance is shown in 
brackets) and (b) percentage of times the mean and median show a higher posterior value on the remaining (joined) 49 
datasets 



(a) Posterior (b) Performance 



Mean Median Consensus Mean Median 



50 


-202.37 (11.23) 


-201.76 (11.29) 


-203.22 (10.10) 


0.76 


0.78 


100 


-409.30 (5.97) 


-409.04 (5.88) 


-409.39 (3.66) 


0.66 


0.64 


250 


-1035.66 (5.12) 


-1035.62 (5.32) 


-1036.72 (5.02) 


0.90 


0.90 


500 


-2074.11 (3.75) 


-2074.01 (3.75) 


-2074.96 (3.99) 


0.78 


0.80 



Note: Both statistics were evaluated separately on datasets of length m = 50, 100, 250 and 500. 
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Fig. 6. Average Frechet variance and credibility radius <i* for datasets of 
length 50, 100, 250 and 500. The error bars show 1 SD 



such that f^ijit\x^t = c. The bound d* may be called the credibility 
radius. Figure 6 shows the results for c = 0.68. 



4 DISCUSSION 

By recognizing the global geometric nature of the space of phylo- 
genetic trees, this article shows that the fundamental statistical 
notions defined over linear spaces, such as sample mean, median 
and variance, can be generalized to more complex spaces such as 
the tree space. Besides the sheer recovery of well-defmed funda- 
mental statistical quantities in the particular setting of phylogen- 
etic studies, this study also demonstrates critical differences in the 
behavior of the posterior mean and the consensus tree. 

The reconstruction of a consensus tree retains splits that occur 
in at least half of the samples. This absolute majority rule 
prevents the introduction of splits favored by sheer fluctuations, 
but also aims to maximize the information extracted from the 
sample. The length of the retained edges is indeed often simply 
set to the average length of their occurrences in the sample, so 
that the lengths of the discarded edges never enter the determin- 
ation of the consensus tree. As illustrated on a real dataset by 
Figure 4, the neglection of a fraction of the sample results in 
biased estunates, where edge lengths are systematically overesti- 
mated from the perspective of the geometry of the tree space. 

The extent of the bias born by the consensus tree is however 
tightly related to the concentration of the posterior distribution. 



which decreases the amount of information dropped in the re- 
construction process, and the simulation-based study shown 
above confirms that the consensus tree and the posterior average 
disagree mostly when there exists no compelhng evidence for a 
single topology. Illustrated on small datasets, the consensus tree 
appears dramatically further of the generating tree than the pos- 
terior mean, as a result of its neglection of a fraction of the 
information brought by the sample. 

The proper definition of a variance for a sample of phylogen- 
etic trees has consequences that should not be overlooked, and is 
believed by the authors to bear even more potential for apphca- 
tions. Not only is the reporting of the credibihty of the Bayesian 
estimate made simple by this quantity, but it also opens the way 
to the generalization of variance-based studies of samples of 
phylogenetic trees, including principal components analysis, a 
task already tackled by Nye (2011). Measuring the spread of a 
set of trees is a useful tool not only to quantify the posterior 
uncertainty. For instance, in a recent study Salichos et al. (2014) 
developed an information theoretic measure to quantify the in- 
congruence of gene trees. 
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